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SUr/IMARY 
The  chi-square  probability  plot  of  the  ordered  generalized 
distances  from  the  sample  mean  vector  has  been  suggested  for  use  in 
checking  the  normal  assumption  for  a  given  body  of  multivariate 
observations.  Here,  the  results  of  an  empirical  study  are  presented 
to  illustrate  the  small  and  large  sample  behavior  of  this  plot  for 
samples  from  both  normal  and  non-normal  populations.  Our  sample 
sizes  range  from  10  to  200.   As  might  be  expected,  samples  of  10 
tell  us  almost  nothing  about  multivariate  normality,  whereas  samples 
of  200  seem  very  stable  except  for  their  few  highest  points.  In 
general,  for  samples  of  size  30  or  more,  the  chi-square  plot  appears 
to  be  an  effective  graphical  aid  for  assessing  multivariate  normality. 

Keywords:   Multivariate  Normality;  Generalized  Distances;  Chi-Square 
Probability  Plot;   Computer  Simulations 


1.   INTRODUCTION 

Since  the  assumption  of  multivariate  normality  underlies 
much  of  the  classical  multivariate  statistical  methodology,  it 
would  be  useful  to  have  procedures  for  checking  the  validity  of  the 
normal  assumption  for  a  given  set  of  multivariate  data.  As  pointed 
out  by  Gnanadesikan  (1977)  and  Everitt  (1978)  ,  such  checks  would 
be  helpful  in  guiding  the  subsequent  analysis  of  the  data,  perhaps 
by  suggesting  the  need  for  and  nature  of  a  transformation  of  the 
data  to  make  them  more  nearly  normally  distributed,  or  perhaps  by 
indicating  appropriate  modifications  of  the  models  and  methods  for 
analyzing  the  data.  Numerical  methods  for  testing  joint  normality 
have  been  developed  by  Mardia  (1970,  1975) ,  Malkovich,  J.F.  and 
Afifi,  A. A.  (1973),  and  Andrews  et„  al.  (1971,  1972,  1973);  a 
detailed  summary  of  these  techniques  is  given  in  Gnanadisikan  (1977) . 

Healy  (1958)  first  proposed  an  extension  of  the  normal 
probability  plot  to  assess  multivariate  normality  graphically.  His 
method  uses  the  generalized  distance,  d.  ,  of  each  point  from  the 

sample  mean  vector,  where 

_  T   -1 
d.  =  (x.  -  x)    S    (x.  -  X) 
1    ~i   ~         -^   ~ 

and  X.  is  the  i  th  observation  vector,  x  is  the  p-variate  sample  mean 
vector,  and S is  the  sample  variance-covariance  matrix.  If  the  sample 
data  are  from  a  multivariate  normal  distribution,  then  these  distances 
have  approximately  a  chi-square  distribution  with  p  degrees  of  freedom. 


(The  exact  marginal  distribution  of  d.  is  known  to  be  a  constant 
multiple  of  a  beta  rather  than  a  chi-square  distribution;  see 
Gnanadesikan  and  Kettenring,  1972).   This  distributional  property 
suggests  that  a  chi-square  probability  plot  of  the  ordered 
generalized  distances  may  be  useful  for  assessing  joint  normality. 
Specifically,  the  n  generalized  distances,  d.  (i  =  1,...,  n),  are 
ordered  in  magnitude,  and  the  ith  ordered  value  is  plotted 
against  the  quantile  of  a  chi-square  distribution  with  p  degrees  of 
freedom  corresponding  to  a  cumulative  probability  of  (i  -  *5)/n,  for 
i  =  l,...,n.   For  multivariate  data,  the  resulting  plot  should  resemble 
a  straight  line  passing  through  the  origin;  departures  from  normality 
will  be  indicated  by  departures  from  linearity  in  this  plot.  In  1973, 
Andrews  et.  al.  suggested  a  graphical  procedure  that  utilizes  a 
generalized  distances  -  and  -  angles  representation  of  multivariate 
data,  a  procedure  which  enconpasses  Healy's  technique. 

Like  other  probability  plotting  procedures,  the  proposed 
chi-square  probability  plot  can  be  considered  as  an  informal  but 
informative  graphical  aid  for  data  analysis.  However,  unless  the  users 
acquire  some  feeling  for  the  behavior  of  this  plot  for  samples  from 
both  normal  and  non-normal  distribution,  its  effectiveness  would  be 
limited.   Some  examples  illustrating  the  usefulness  of  the  chi-square 
probability  plot  for  assessing  joint  normality  are  given  in  Gnanadesikan 
(1977)  and  Everitt  (1978);  but,  no  systematic  study  has  been  done  to 
examine  the  small  and  large  sample  behavior  of  this  plot  for  samples 


from  both  normal  and  non-normal  populations. 

In  this  study,  we  use  computer  simulated  random  samples  to 
examine  the  behavior  of  the  chi-square  probability  plot  for  both  the 
null  case  (normal  population)  and  the  non-null  case  (non-normal  population) 
In  the  null  case,  we  generated  samples  of  size   n  =  10,  20,  30,  50,  and 
200  from  2-  and  5-  variate  normal  distributions,  and  obtained  the 
corresponding  chi-square  probability  plots.   These  generated  plots 
provide  a  reference  for  judging  the  normality  of  a  given  sample;  they 
are  given  in  Section  2. 

The  results  obtained  for  the  non-null  case  are  given  in 
Section  3.  There,  generated  plots  are  produced  to  illustrate  the 
power  of  the  chi-square  probability  plot  when  samples  are  drawn  from 
2-  and  5-  variate  (independently  distributed  variates)  non-normal 
distributions  including  chi-square   (10  degrees  of  freedom) ,  10% 
contaminated  normal,  and  Cauchy  distributions.  Concluding  remarks  are 
given  in  the  final  section. 


2.   CHI-SQUARE  PROBABILITY  PLOTS  FOR  NORMAL  DATA 

All  the  probability  plots  presented  in  this  paper  were 
produced  on  an  IBM  370/168  computer,  and  the  plotting  routine  used 
is  a  modification  of  Spark's  (1971)  algorithm.  In  each  of  the  plots, 
a  line  with  unit  slope  which  passes  through  the  origin  is  plotted  for 
reference  purposes;  moreover,  multiple  points  are  represented  by  the 
letter  "0"  and  single  point  by  the  asterisk  "*".  And  the  required 
chi-square  quantiles  were  computed  using  the  methods  of  Kuo  (1965) 
and  Goldstein  (1973) . 

The  random  number  generator  used  is  the  one  given  in  Lewis 
et.  al.  (1969) ,  In  the  null  case,  normal  deviates  were  computed  from 
the  generated  random  numbers  using  Cunningham's  (1969)  algorithm. 
For  each  of  the  two  number  of  dimensions,  p  =  2  and  5,  four  chi-square 
probability  plots  were  produced  for  each  of  the  sample  sizes  n=  10,  20, 
30,  50,  and  200;   half  of  the  normal  samples  were  drawn  from  MVN(0,  I), 
where  I  is  the  identity  matrix  of  rank  p,  and  half  were  drawn  from 
MVN  (0,  E  )  where  E  is  some  randomly  generated  positive  -  definite 
matrix.   The  plots  are  given  in  Figures  lA  -  5A  (p=  2)  and 
Figures  IB  -  5B  (p=  5)  respectively  for  the  five  sample  sizes  n=10,  20, 
30,  50  and  200.   As  might  be  expected,  samples  of  10  tell  us  almost 
nothing  about  normality,  whereas  samples  of  200  seem  very  stable 
except  for  their  few  highest points.   Samples  of  20  show  wild 
fluctuations;  samples  of  30  are  better  behaved;  and  samples  of  50 


nearly  always  appear  linear  but  fluctuate  at  their  upper  ends. 
Neither  the  population  covariance  structure  nor  the  number  of 
dimension  appears  to  affect  the  behavior  of  the  plot. 

(Insert  Figures  lA-B  to  5A-B  here.) 


3.   CHI-SQUARE  PROBABILITY  PLOTS  FOR  NON-NORMAL  DATA 

In  order  to  examine  the  power  of  the  chi-square  probability 
plot,  we  repeated  the  simulation  study  using  samples  from  non-normal 
distributions . 

3.1  10%  Contaminated  Multivariate  Normal  Distribution 

Here  the  generated  data  are  from  90%  MVN  (0,  I)  +  10%  MVN  (0,101) 
For  each  of  the  two  number  of  dimensions,  p=  2  and  5,  two  chi-square 
probability  plots  were  produced  for  each  of  the  sample  sizes  n=  10,30  and 
200.   At  n=  200,  the  chi-square  plots  clearly  indicate  non-normality 
(see  Figure  6) ;  at  n=30,  there  is  some  evidence  that  the  distribution 
is  not  normal  (see  Figure  7) ;  but  at  n=  10,  no  indication  of  contamination 
can  be  observed  (see  Figure  8) .   Thus  as  expected,  large  sample  size  is 
needed  to  detect  contamination. 

(Insert  Figures  6  to  8  here) . 

3.2  Multivariate  Chi-Square  Distribution 

The  generated  data  consists  of  p  independent  (p=  2  and  5) 
variates,  each  of  which  has  a  chi-square  (10)  distribution.  This  is  a 
mildly  skewed  distribution  and  at  n=  200,  there  is  again  a  clear 
indication  of  non-normality  (see  Figure  9) ;  but  for  smaller  samples 
(n=10  and  30) ,  we  find  it  hard  to  distinguish  the  plots  given  in 
Figures  10  and  11  from  those  given  in  Figures  lA-B  and  3A-B  respectively. 


Hence,  a  rather  large  sample  is  needad  to  detect  the  moderate  skewness 
in  the  chi-square  population,  and  it  should  be  noted  that  the  chi-square 
probability  plots  obtained  from  a  "skewed"  distribution  are  convex. 

(Insert  Figures  9  to  11  here.) 

3.3  Multivariate  Cauchy  distribution 

This  is  an  extremely  heavy-tailed  distribution.  Even  at  n=  10, 
the  probability  plots  are  far  from  linear  (see  Figure  12) .  Apparently, 
the  chi-square  probability  plot  has  good  power  for  rejecting  normality 
when  the  time  distribution  is  heavy-tailed.  The  plots  for  the  sample 
sizes  n=  20  and  100  are  given  in  Figures  13  and  14  respectively. 

(Insert  Figures  12  to  14  here.) 

3.4  Multivariate  Normal  Distribution  with  an  Outlier 

Since  the  commonest  type  of  non-normality  is  a  single  oversized 
outlier,  it  would  be  interesting  to  find  out  if  the  chi-square  probability 
plot  would  be  able  to  detect  multivariate  outliers.  Realizing  that  there 
may  be  various  types  of  multivariate  outliers  (see  Gnanadesikan  and 
Kettenring,  1972) ,  we  confine  our  effort  to  studying  outliers  in  one 
particular  dimension.   In  each  of  the  plots  given  in  Figure  15,  one 
observation  has  been  contaminated  by  adding  10  standard  deviations  to 


the  first  variate.   For  samples  of  30  and  more,  the  chi-square  plots 
consistently  identify  the  outliers;  however,  the  plots  are  not  that 
effective  for  small  samples. 

(Insert  Figure  15  here.) 


4.   DISCUSSION 

The  object  of  this  simulation  study  is  to  illustrate  the 
usefulness  of  the  chi-square  probability  plot  for  assessing  multi- 
variate normality.   The  results  indicate  that  the  minimum  sample 
size  needed  to  detect  non-normality  depends  on  the  degree  to  which  the 
underlying  distribution  is  "non-normal";   but,  a  sample  size  of  30  is 
usually  sufficient  to  detect  all  but  slight  deviations  from  normality. 
For  normal  samples,  chi-square  plots  for  various  sample  sizes  were 
generated  to  provide  a  reference  for  the  "expected"  behavior.  However, 
it  should  be  pointed  out  that  it  is  only  through  routine  usage  that 
an  user  can  become  more  adept  at  interpreting  the  chi-square  plots . 
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